The extant immunoglobulin superfamily, member 1 gene results from an ancestral gene duplication in eutherian mammals

Immunoglobulin superfamily, member 1 (IGSF1) is a transmembrane glycoprotein with high expression in the mammalian pituitary gland. Mutations in the IGSF1 gene cause congenital central hypothyroidism in humans. The IGSF1 protein is co-translationally cleaved into N- and C-terminal domains (NTD and CTD), the latter of which is trafficked to the plasma membrane and appears to be the functional portion of the molecule. Though the IGSF1-NTD is retained in the endoplasmic reticulum and has no apparent function, it has a high degree of sequence identity with the IGSF1-CTD and is conserved across mammalian species. Based upon phylogenetic analyses, we propose that the ancestral IGSF1 gene encoded the IGSF1-CTD, which was duplicated and integrated immediately upstream of itself, yielding a larger protein encompassing the IGSF1-NTD and IGSF1-CTD. The selective pressures favoring the initial gene duplication and subsequent retention of a conserved IGSF1-NTD are unresolved.

The full-length IGSF1 cDNA encodes a 12 immunoglobulin (Ig) loop transmembrane glycoprotein that is co-translationally cleaved at an internal signal sequence in the endoplasmic reticulum (ER) into N-terminal (NTD) and C-terminal domains (CTD) [7] (Fig 1A). Following cleavage, both domains possess Ig loops (five and seven, respectively) in the ER lumen, a a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 single transmembrane domain, and a short cytoplasmic carboxy tail. The NTD is retained in the ER, whereas the CTD is trafficked to the plasma membrane and appears to be the functional part of the molecule [7]. This concept is supported by the fact that the vast majority of intragenic mutations in IGSF1 map to the part of the gene encoding the CTD, and these  mutations inhibit the protein's plasma membrane trafficking ( Fig 1B) [3,[8][9][10][11][12][13][14][15][16][17][18]. Two diseaseassociated mutations in the NTD have been described thus far, but both cause frameshifts that preclude expression of the CTD [4,8,9,[11][12][13][14][15][16][17][18][19][20]. At least two observations suggest that the extant IGSF1 gene is the product of gene duplication. First, the NTD and CTD have high sequence identity and domain structure similarity [21]. Second, at least in rodents, the pituitary expresses an mRNA isoform that encodes only the CTD [22]. We hypothesize that the ancestral IGSF1 gene in mammals encoded the CTD, and that this gene was duplicated during evolution and integrated immediately 5' of itself. To test the idea that the NTD arose from duplication of the CTD, we performed phylogenetic analyses of IGSF1. Next, we compared IGSF1/Igsf1 cDNA and genomic sequences across a large number of species and explored the conservation of the seemingly non-functional NTD.

Multiple sequence alignment and phylogenetic trees
Orthologs of IGSF1 and its relatives (A1BG, OSCAR, FCAR, TARM1, NCR1, LILRA5, and VSTM1) were found using the bi-directional best hits (BBH) method in Mammalia and other amniotes. Organisms that have reference proteomes in the UniProt database (https://www. uniprot.org/proteomes/) were used for ortholog searching [23]. Orthologs of IGSF1 were split into the NTD and CTD parts of the molecules. Multiple sequence alignments of the orthologs, together with the IGSF1-NTD and -CTD regions, were made using Clustal Omega [24], with default parameters. The proteins used in our analyses are collated in S1 Raw images.
Phylogenetic trees were made using PhyML with default parameters [25], except as described in the relevant figure legends. Trees were drawn using FigTree (http://tree.bio.ed.ac. uk/software/figtree/).

Identity and similarity of Ig loops
Amino acid sequences of individual IGSF1 Ig loops were identified from the human NCBI Reference Sequence: NP_001546.2. Loops from the NTD were compared with loops from the CTD using the NCBI BLASTP tool to determine sequence identity and similarity [26].
All animal work was performed in accordance with institutional and federal guidelines under an animal use protocol (AUP5204) approved by the McGill University Facility Animal Care Committee DOW-A.

Sequence conservation between the IGSF1-NTD and -CTD
Examination of the amino acid sequence of human IGSF1 revealed remarkable identity between its NTD and CTD. Ig loops from the NTD clustered with Ig loops from the CTD in a phylogenetic tree-loop 1 with 6, loop 2 with 7, loop 3 with 9 and 11, loop 4 with 10, and loop 5 with 12, with branch support values of 1, 0.91, 0.89, 0.96, and 0.98, respectively (Fig 2). The amino acid sequence identity between the Ig loops that clustered together ranged from 44 to 63%, and the similarity ranged from 58-80% (Table 1). In contrast, non-clustering Ig loops had sequence identities from 23 to 44%, and similarities from 41 to 60%. Ig loop 8 in the CTD lacked a corresponding loop in the NTD ( Fig 2C). Further, the branch lengths for the NTD Ig loops 2, 3, 4 and 5 were longer than for the CTD Ig loops, indicating less constraint on their sequences.

IGSF1-NTD and -CTD evolved independently
The similarity between the IGSF1-NTD and -CTD suggested that the duplication of one may have led to the emergence of the other. We performed another phylogenetic analysis, this time examining the NTD and CTD sequences across species. We found no evidence of IGSF1 in non-mammalian species and we did not observe clustering with sequences from non-eutherian amniotes. The CTDs in mammals clustered together, as did the NTDs (Fig 3). However, the NTDs and CTDs clustered separately in the tree (compare IGSF1-N and IGSF1-C in Fig 3).

IGSF1 and the LRC family
As IGSF1 was recently designated a member of the leukocyte receptor cluster (LRC) family [21], we included LRC homologs of IGSF1 in our analysis. The human IGSF1-NTD and -CTD clustered with sets of orthologs of the alpha-1-B glycoprotein (A1BG) (Fig 3). BBH orthologs of IGSF1-NTD and -CTD, as well as for A1BG, were observed only in eutheria, indicating a common ancestry during the early eutherian epoch. These proteins did not cluster with LRC homologs osteoclast-associated immunoglobulin-like receptor (OSCAR), immunoglobulin alpha Fc receptor (FCAR), T-cell-interacting, activating receptor on myeloid cells protein 1 (TARM1), natural cytotoxicity triggering receptor 1 (NCR1), leukocyte immunoglobulin-like receptor A5 (LILRA5), or V-set and transmembrane domain-containing protein 1 (VSTM1), indicating distinct ancestry. VSTM1 contains only one Ig loop, and this loop was aligned to IGSF1-NTD Ig loop 4 (loop 10 in the CTD) by Clustal Omega [24].
We performed an additional phylogenetic analysis using only the sequence encoding Ig loop 4 of IGSF1, which is the deepest part of the Clustal Omega alignment, comprising all of the LRC sequences examined (Figs 4 and 5). Here again, despite the limited sequence information, IGSF1-NTD and -CTD clustered with A1BG, albeit with a weaker branch support value of 0.6 (Fig 4). In this tree, among the other LRC members examined, OSCAR clustered most closely with IGSF1-NTD and -CTD and A1BG (Fig 4).  [22,28,29]. The murine Igsf1 isoform 1 (IGSF1-1) transcript encodes the full-length IGSF1 protein, including both the NTD and CTD. In contrast, murine Igsf1 isoform 4 (IGSF1-4) transcript, which initiates in intron 9, encodes the CTD alone. When expressed in heterologous HEK293 cells, the CTDs from IGSF1-1 and IGSF1-4 were indistinguishable and comigrated with IGSF1 endogenously expressed in the murine pituitary (Fig 6).

Discussion
According to our analysis, a gene duplication event gave rise to the extant IGSF1 gene early in eutherian evolution since their last common ancestor. We propose that the ancestral IGSF1 gene encoded the IGSF1 C-terminal domain (CTD) alone. This gene was then duplicated and inserted immediately upstream (5') of itself on Chr. X, creating a novel gene that encodes a large transcript from which the N-terminal domain (NTD) and CTD are co-translationally derived. Consistent with our argument, most of the Ig loops in the NTD and CTD cluster together in a phylogenetic tree (1 and 6; 2 and 7; 3 with 9 and 11; 4 and 10; and 5 and 12). Ig loop 8 in the CTD does not cluster with other loops, suggesting that it was lost from the NTD during or after the duplication event. Ig loop 3 of the NTD shares a root with loops 9 and 11 of the CTD, with a high branch support value of 0.89. We suggest two possibilities for this similarity: 1) during the gene duplication, Ig loop 11 was additionally duplicated and inserted upstream to create Ig loop 9 (or vice versa, Ig loop 9 was duplicated and inserted downstream of itself), or 2) Ig loop 9 or 11 was earlier duplicated and loop 11 was subsequently lost from the NTD during duplication of the CTD (Fig 7). Ig loop 4 shares high sequence similarity with loops 9 and 10, though it is more similar to, and clusters with, loop 10. Additionally, loops 4/ 10 and 5/12 have higher similarity than other loops clustered together in the tree. This sequence similarity could indicate that these loops are particularly important for protein function. Interestingly, a high proportion of pathogenic mutations cluster in and around loops 10 and 12 (Fig 1B).
Several observations converge to support our argument that the CTD-, rather than NTDencoding part of IGSF1 was duplicated. First, the CTD is more complex (7 Ig loops) than the NTD (5 Ig loops) and is more highly conserved. As can be seen in the phylogram comparing Ig loops from the NTD and CTD (Fig 2C), branch lengths are longer for all but one of the Ig domains in the NTD compared to the CTD, indicating that the NTD is mutating at a faster rate at the amino acid level (similarly in Fig 3). In fact, the CTD is mutating more slowly than any member of the LRC family (Figs 3 and 4). Second, at least in rodents, there is a transcript that encodes the CTD alone (what we previously referred to as isoform 4). Transcription of isoform 4 initiates in intron 9 [22]. The open reading frame of the resulting mRNA contains a signal peptide coding sequence at its N-terminus. This signal peptide enables the CTD derived Phylogenetic tree based on a Clustal Omega alignment of IGSF1 and its human paralogs and their sets of orthologs across eutheria made using PhyML default parameters, except that starting tree topology and invariable sites were optimized, and the PhyML option for using the best of the available tree searching operations was applied. Bi-directional best hits (BBH) from a selection of non-eutherian amniotes were included as outgroups. The IGSF1-NTD and -CTD regions were aligned separately. The clades for IGSF1-NTD and -CTD, A1BG, OSCAR, FCAR, TARM1, NCR1, LILRA5, and VSTM1, and 'Other Amniote' sequences are labelled on the appropriate branches, and collapsed to wedge shapes, except for the IGSF1 and A1BG clades, which are highlighted in grey. The nodes at which branches end were color-coded by PHyML aLRT branch support value as indicated by the heat map. The tree was drawn using FigTree.
https://doi.org/10.1371/journal.pone.0267744.g003 from isoform 4 to be expressed as a plasma membrane protein (Fig 6). It also provides the basis for the internal cleavage of the full-length protein into the NTD and CTD by signal peptidase [7,22] (Fig 1A). We are unaware of other cellular proteins that are processed in this manner (i.e., via an internal signal peptide). The most parsimonious explanation is that this signal peptide has maintained its ancestral function. Interestingly, the human equivalent of murine Igsf1 isoform 4 has not been reported. This suggests that transcription from what we propose to be the ancestral promoter in intron 9 has been lost in human evolution. Though present, isoform 4 is expressed at far lower levels than the full-length Igsf1 mRNA (isoform 1) in murine pituitary [22], further suggesting that the activity of this promoter may similarly be diminishing in rodents. Notably, Igsf1 knockout mice lacking exon 1 express isoform 4, but not isoform 1. Expression levels of isoform 4 in these mice are unaltered relative to wild-type and they do not express the CTD at sufficient levels to functionally compensate for the loss of the CTD derived from isoform 1 [3,6]. Collectively, the data suggest that, over time, activity of the ancestral promoter (in intron 9) has been reduced or lost, with transcription initiating principally from exon 1 and the CTD being derived from the co-translational cleavage of the large precursor protein via the ancestral (internal) signal peptide. Perhaps the acquisition of a new promoter (upstream of exon 1) following the duplication event conferred a selective advantage by quantitatively or qualitatively (spatially or temporally) altering IGSF1 gene and protein expression. In this event, the acquisition of the NTD protein could be coincidental rather than advantageous.
Consistent with this latter idea, the NTD is retained in the ER and has no apparent cellular function [7]. As we show here (Fig 6), the NTD is not needed for the expression or plasma membrane trafficking of the CTD. Given that the protein encoded by the full-length mRNA (isoform 1) is co-translationally cleaved, the NTD at best may serve as a large, but functionless N-terminal prodomain for the CTD. Still, if this is the case, why is the NTD sequence highly conserved? This conservation suggests that the NTD may currently play or perhaps previously played a functional role, even if retained in the ER. That said, the only mutations we are aware of in the NTD that are associated with IGSF1 deficiency (or any disorder) cause frameshifts that preclude expression of the CTD (Fig 1B). Therefore, while there may be selective pressure to maintain the open reading frame of mRNA isoform 1, it is unclear why missense mutations in the NTD-encoding part of the gene are not more prevalent if the NTD is truly functionless.
Finally, as IGSF1 was recently found to be a member of the LRC family (21), we included LRC members in our phylogenetic analysis. We see some divergence between the family members. IGSF1 clusters with A1BG, with only BBH orthologs detected within eutheria, indicating common ancestry during eutherian evolution. It is possible that IGSF1 function, which is not currently known, is more similar to A1BG than to other LRC members. The next most similar protein in the evolutionary analysis is OSCAR (Figs 3 and 4). A1BG and OSCAR function as receptors for extracellular ligands, CRISP-3 and collagens I and III, respectively [30,31]. We therefore postulate that IGSF1 may similarly function as a receptor or binding protein for one or more extracellular ligands.

Conclusion
We contend that the ancestral IGSF1/Igsf1 gene encoded the IGSF1-CTD. During early eutherian evolution, the gene was duplicated and inserted immediately upstream of itself. The novel gene encodes a large transcript from which two proteins are derived: the IGSF1-NTD and IGSF1-CTD. As the IGSF1-CTDs from the extant and ancestral genes are likely the same (or highly similar) and the IGSF1-NTD is retained in the ER, the adaptive significance of the original gene duplication is not clear but may relate to the acquisition of novel patterns of gene expression. The conservation of the IGSF1-NTD suggests that it may have played an important role that it subsequently lost or that it has a currently unappreciated function in the ER.